Multi-Class Classification Model for Sign Language MNIST Using Python and Scikit-Learn

David Lowe

November 16, 2020

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: This project aims to construct a predictive model using various machine learning algorithms and document the end-to-end steps using a template. The Sign Language MNIST dataset is a multi-class classification situation where we attempt to predict one of several (more than two) possible outcomes.

INTRODUCTION: The original MNIST image dataset of handwritten digits is a popular benchmark for image-based machine learning methods. The Sign Language MNIST is presented here and follows the same CSV format with labels and pixel values in single rows to stimulate the community to develop more drop-in replacements. The American Sign Language letter database of hand gestures represent a multi-class problem with 24 classes of letters (excluding J and Z, which require motion).

The dataset format is patterned to match closely with the classic MNIST. Each training and test case represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions). The training data (27,455 cases) and test data (7172 instances) are approximately half the size of the standard MNIST but otherwise similar with a header row of the labels, pixel1,pixel2….pixel784 which represent a single 28x28 pixel image with grayscale values between 0-255. The original hand gesture image data represented multiple users repeating the gesture against different backgrounds.

ANALYSIS: The average performance of the machine learning algorithms achieved an accuracy benchmark of 96.38%. Two algorithms (Extra Trees and Random Forest) produced the top accuracy metrics after the first round of modeling. After a series of tuning trials, the Extra Trees model turned in an accuracy metric of 99.61%. When configured with the optimized parameters, the Extra Trees model processed the validation dataset with an accuracy score of 99.83%. When we applied the Extra Trees model to the previously unseen test dataset, we obtained an accuracy score of 83.49%.

CONCLUSION: In this iteration, the Extra Trees model appeared to be a suitable algorithm for modeling this dataset. We should consider using the Extra Trees algorithm for further modeling.

Dataset Used: Sign Language MNIST Data Set

Dataset ML Model: Multi-Class classification with numerical attributes

Dataset Reference: https://www.kaggle.com/datamunge/sign-language-mnist

One source of potential performance benchmarks: https://www.kaggle.com/datamunge/sign-language-mnist

Any predictive modeling machine learning project generally can be broken down into about six major tasks:

  1. Prepare Environment
  2. Summarize and Visualize Data
  3. Pre-process Data
  4. Train and Evaluate Models
  5. Fine-tune and Improve Models
  6. Finalize Model and Present Analysis

Task 1 - Prepare Environment

In [1]:
# Install the necessary packages for Colab
# !pip install python-dotenv PyMySQL
In [2]:
# Retrieve the GPU information from Colab
# gpu_info = !nvidia-smi
# gpu_info = '\n'.join(gpu_info)
# if gpu_info.find('failed') >= 0:
#     print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
#     print('and then re-execute this cell.')
# else:
#     print(gpu_info)
In [3]:
# Retrieve the memory configuration from Colab
# from psutil import virtual_memory
# ram_gb = virtual_memory().total / 1e9
# print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

# if ram_gb < 20:
#     print('To enable a high-RAM runtime, select the Runtime → "Change runtime type"')
#     print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
#     print('re-execute this cell.')
# else:
#     print('You are using a high-RAM runtime!')
In [4]:
# Retrieve the CPU information
ncpu = !nproc
print("The number of available CPUs is:", ncpu[0])
The number of available CPUs is: 4

1.a) Load libraries and modules

In [5]:
# Set the random seed number for reproducible results
seedNum = 888
In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import sys
import math
import boto3
from datetime import datetime
from dotenv import load_dotenv
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
# from sklearn.pipeline import Pipeline
# from sklearn.feature_selection import RFE
# from imblearn.pipeline import Pipeline
# from imblearn.over_sampling import SMOTE
# from imblearn.under_sampling import RandomUnderSampler

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

1.b) Set up the controlling parameters and functions

In [7]:
# Begin the timer for the script processing
startTimeScript = datetime.now()

# Set up the number of CPU cores available for multi-thread processing
n_jobs = 2

# Set up the flag to stop sending progress emails (setting to True will send status emails!)
notifyStatus = False

# Configure the plotting style
plt.style.use('seaborn')

# Set Pandas options
pd.set_option("display.max_rows", 500)
pd.set_option("display.width", 140)

# Set the percentage sizes for splitting the dataset
test_set_size = 0.2
val_set_size = 0.25

# Set the number of folds for cross validation
n_folds = 5

# Set various default modeling parameters
scoring = 'accuracy'
In [8]:
# Set up the parent directory location for loading the dotenv files
# useColab = True
# if useColab:
#     # Mount Google Drive locally for storing files
#     from google.colab import drive
#     drive.mount('/content/gdrive')
#     gdrivePrefix = '/content/gdrive/My Drive/Colab_Downloads/'
#     env_path = '/content/gdrive/My Drive/Colab Notebooks/'
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)

# Set up the dotenv file for retrieving environment variables
# useLocalPC = True
# if useLocalPC:
#     env_path = "/Users/david/PycharmProjects/"
#     dotenv_path = env_path + "python_script.env"
#     load_dotenv(dotenv_path=dotenv_path)
In [9]:
# Set up the email notification function
def status_notify(msg_text):
    access_key = os.environ.get('SNS_ACCESS_KEY')
    secret_key = os.environ.get('SNS_SECRET_KEY')
    aws_region = os.environ.get('SNS_AWS_REGION')
    topic_arn = os.environ.get('SNS_TOPIC_ARN')
    if (access_key is None) or (secret_key is None) or (aws_region is None):
        sys.exit("Incomplete notification setup info. Script Processing Aborted!!!")
    sns = boto3.client('sns', aws_access_key_id=access_key, aws_secret_access_key=secret_key, region_name=aws_region)
    response = sns.publish(TopicArn=topic_arn, Message=msg_text)
    if response['ResponseMetadata']['HTTPStatusCode'] != 200 :
        print('Status notification not OK with HTTP status code:', response['ResponseMetadata']['HTTPStatusCode'])
In [10]:
if notifyStatus: status_notify("Task 1 - Prepare Environment has begun! " + datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

1.c) Load dataset

In [11]:
dataset_path = 'https://dainesanalytics.com/datasets/kaggle-sign-language-mnist/sign_mnist_train.csv'
Xy_original = pd.read_csv(dataset_path, sep=',')

# Take a peek at the dataframe after import
Xy_original.head()
Out[11]:
label pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
0 3 107 118 127 134 139 143 146 150 153 ... 207 207 207 207 206 206 206 204 203 202
1 6 155 157 156 156 156 157 156 158 158 ... 69 149 128 87 94 163 175 103 135 149
2 2 187 188 188 187 187 186 187 188 187 ... 202 201 200 199 198 199 198 195 194 195
3 2 211 211 212 212 211 210 211 210 210 ... 235 234 233 231 230 226 225 222 229 163
4 13 164 167 170 172 176 179 180 184 185 ... 92 105 105 108 133 163 157 163 164 179

5 rows × 785 columns

In [12]:
Xy_original.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27455 entries, 0 to 27454
Data columns (total 785 columns):
 #   Column    Dtype
---  ------    -----
 0   label     int64
 1   pixel1    int64
 2   pixel2    int64
 3   pixel3    int64
 4   pixel4    int64
 5   pixel5    int64
 6   pixel6    int64
 7   pixel7    int64
 8   pixel8    int64
 9   pixel9    int64
 10  pixel10   int64
 11  pixel11   int64
 12  pixel12   int64
 13  pixel13   int64
 14  pixel14   int64
 15  pixel15   int64
 16  pixel16   int64
 17  pixel17   int64
 18  pixel18   int64
 19  pixel19   int64
 20  pixel20   int64
 21  pixel21   int64
 22  pixel22   int64
 23  pixel23   int64
 24  pixel24   int64
 25  pixel25   int64
 26  pixel26   int64
 27  pixel27   int64
 28  pixel28   int64
 29  pixel29   int64
 30  pixel30   int64
 31  pixel31   int64
 32  pixel32   int64
 33  pixel33   int64
 34  pixel34   int64
 35  pixel35   int64
 36  pixel36   int64
 37  pixel37   int64
 38  pixel38   int64
 39  pixel39   int64
 40  pixel40   int64
 41  pixel41   int64
 42  pixel42   int64
 43  pixel43   int64
 44  pixel44   int64
 45  pixel45   int64
 46  pixel46   int64
 47  pixel47   int64
 48  pixel48   int64
 49  pixel49   int64
 50  pixel50   int64
 51  pixel51   int64
 52  pixel52   int64
 53  pixel53   int64
 54  pixel54   int64
 55  pixel55   int64
 56  pixel56   int64
 57  pixel57   int64
 58  pixel58   int64
 59  pixel59   int64
 60  pixel60   int64
 61  pixel61   int64
 62  pixel62   int64
 63  pixel63   int64
 64  pixel64   int64
 65  pixel65   int64
 66  pixel66   int64
 67  pixel67   int64
 68  pixel68   int64
 69  pixel69   int64
 70  pixel70   int64
 71  pixel71   int64
 72  pixel72   int64
 73  pixel73   int64
 74  pixel74   int64
 75  pixel75   int64
 76  pixel76   int64
 77  pixel77   int64
 78  pixel78   int64
 79  pixel79   int64
 80  pixel80   int64
 81  pixel81   int64
 82  pixel82   int64
 83  pixel83   int64
 84  pixel84   int64
 85  pixel85   int64
 86  pixel86   int64
 87  pixel87   int64
 88  pixel88   int64
 89  pixel89   int64
 90  pixel90   int64
 91  pixel91   int64
 92  pixel92   int64
 93  pixel93   int64
 94  pixel94   int64
 95  pixel95   int64
 96  pixel96   int64
 97  pixel97   int64
 98  pixel98   int64
 99  pixel99   int64
 100 pixel100  int64
 101 pixel101  int64
 102 pixel102  int64
 103 pixel103  int64
 104 pixel104  int64
 105 pixel105  int64
 106 pixel106  int64
 107 pixel107  int64
 108 pixel108  int64
 109 pixel109  int64
 110 pixel110  int64
 111 pixel111  int64
 112 pixel112  int64
 113 pixel113  int64
 114 pixel114  int64
 115 pixel115  int64
 116 pixel116  int64
 117 pixel117  int64
 118 pixel118  int64
 119 pixel119  int64
 120 pixel120  int64
 121 pixel121  int64
 122 pixel122  int64
 123 pixel123  int64
 124 pixel124  int64
 125 pixel125  int64
 126 pixel126  int64
 127 pixel127  int64
 128 pixel128  int64
 129 pixel129  int64
 130 pixel130  int64
 131 pixel131  int64
 132 pixel132  int64
 133 pixel133  int64
 134 pixel134  int64
 135 pixel135  int64
 136 pixel136  int64
 137 pixel137  int64
 138 pixel138  int64
 139 pixel139  int64
 140 pixel140  int64
 141 pixel141  int64
 142 pixel142  int64
 143 pixel143  int64
 144 pixel144  int64
 145 pixel145  int64
 146 pixel146  int64
 147 pixel147  int64
 148 pixel148  int64
 149 pixel149  int64
 150 pixel150  int64
 151 pixel151  int64
 152 pixel152  int64
 153 pixel153  int64
 154 pixel154  int64
 155 pixel155  int64
 156 pixel156  int64
 157 pixel157  int64
 158 pixel158  int64
 159 pixel159  int64
 160 pixel160  int64
 161 pixel161  int64
 162 pixel162  int64
 163 pixel163  int64
 164 pixel164  int64
 165 pixel165  int64
 166 pixel166  int64
 167 pixel167  int64
 168 pixel168  int64
 169 pixel169  int64
 170 pixel170  int64
 171 pixel171  int64
 172 pixel172  int64
 173 pixel173  int64
 174 pixel174  int64
 175 pixel175  int64
 176 pixel176  int64
 177 pixel177  int64
 178 pixel178  int64
 179 pixel179  int64
 180 pixel180  int64
 181 pixel181  int64
 182 pixel182  int64
 183 pixel183  int64
 184 pixel184  int64
 185 pixel185  int64
 186 pixel186  int64
 187 pixel187  int64
 188 pixel188  int64
 189 pixel189  int64
 190 pixel190  int64
 191 pixel191  int64
 192 pixel192  int64
 193 pixel193  int64
 194 pixel194  int64
 195 pixel195  int64
 196 pixel196  int64
 197 pixel197  int64
 198 pixel198  int64
 199 pixel199  int64
 200 pixel200  int64
 201 pixel201  int64
 202 pixel202  int64
 203 pixel203  int64
 204 pixel204  int64
 205 pixel205  int64
 206 pixel206  int64
 207 pixel207  int64
 208 pixel208  int64
 209 pixel209  int64
 210 pixel210  int64
 211 pixel211  int64
 212 pixel212  int64
 213 pixel213  int64
 214 pixel214  int64
 215 pixel215  int64
 216 pixel216  int64
 217 pixel217  int64
 218 pixel218  int64
 219 pixel219  int64
 220 pixel220  int64
 221 pixel221  int64
 222 pixel222  int64
 223 pixel223  int64
 224 pixel224  int64
 225 pixel225  int64
 226 pixel226  int64
 227 pixel227  int64
 228 pixel228  int64
 229 pixel229  int64
 230 pixel230  int64
 231 pixel231  int64
 232 pixel232  int64
 233 pixel233  int64
 234 pixel234  int64
 235 pixel235  int64
 236 pixel236  int64
 237 pixel237  int64
 238 pixel238  int64
 239 pixel239  int64
 240 pixel240  int64
 241 pixel241  int64
 242 pixel242  int64
 243 pixel243  int64
 244 pixel244  int64
 245 pixel245  int64
 246 pixel246  int64
 247 pixel247  int64
 248 pixel248  int64
 249 pixel249  int64
 250 pixel250  int64
 251 pixel251  int64
 252 pixel252  int64
 253 pixel253  int64
 254 pixel254  int64
 255 pixel255  int64
 256 pixel256  int64
 257 pixel257  int64
 258 pixel258  int64
 259 pixel259  int64
 260 pixel260  int64
 261 pixel261  int64
 262 pixel262  int64
 263 pixel263  int64
 264 pixel264  int64
 265 pixel265  int64
 266 pixel266  int64
 267 pixel267  int64
 268 pixel268  int64
 269 pixel269  int64
 270 pixel270  int64
 271 pixel271  int64
 272 pixel272  int64
 273 pixel273  int64
 274 pixel274  int64
 275 pixel275  int64
 276 pixel276  int64
 277 pixel277  int64
 278 pixel278  int64
 279 pixel279  int64
 280 pixel280  int64
 281 pixel281  int64
 282 pixel282  int64
 283 pixel283  int64
 284 pixel284  int64
 285 pixel285  int64
 286 pixel286  int64
 287 pixel287  int64
 288 pixel288  int64
 289 pixel289  int64
 290 pixel290  int64
 291 pixel291  int64
 292 pixel292  int64
 293 pixel293  int64
 294 pixel294  int64
 295 pixel295  int64
 296 pixel296  int64
 297 pixel297  int64
 298 pixel298  int64
 299 pixel299  int64
 300 pixel300  int64
 301 pixel301  int64
 302 pixel302  int64
 303 pixel303  int64
 304 pixel304  int64
 305 pixel305  int64
 306 pixel306  int64
 307 pixel307  int64
 308 pixel308  int64
 309 pixel309  int64
 310 pixel310  int64
 311 pixel311  int64
 312 pixel312  int64
 313 pixel313  int64
 314 pixel314  int64
 315 pixel315  int64
 316 pixel316  int64
 317 pixel317  int64
 318 pixel318  int64
 319 pixel319  int64
 320 pixel320  int64
 321 pixel321  int64
 322 pixel322  int64
 323 pixel323  int64
 324 pixel324  int64
 325 pixel325  int64
 326 pixel326  int64
 327 pixel327  int64
 328 pixel328  int64
 329 pixel329  int64
 330 pixel330  int64
 331 pixel331  int64
 332 pixel332  int64
 333 pixel333  int64
 334 pixel334  int64
 335 pixel335  int64
 336 pixel336  int64
 337 pixel337  int64
 338 pixel338  int64
 339 pixel339  int64
 340 pixel340  int64
 341 pixel341  int64
 342 pixel342  int64
 343 pixel343  int64
 344 pixel344  int64
 345 pixel345  int64
 346 pixel346  int64
 347 pixel347  int64
 348 pixel348  int64
 349 pixel349  int64
 350 pixel350  int64
 351 pixel351  int64
 352 pixel352  int64
 353 pixel353  int64
 354 pixel354  int64
 355 pixel355  int64
 356 pixel356  int64
 357 pixel357  int64
 358 pixel358  int64
 359 pixel359  int64
 360 pixel360  int64
 361 pixel361  int64
 362 pixel362  int64
 363 pixel363  int64
 364 pixel364  int64
 365 pixel365  int64
 366 pixel366  int64
 367 pixel367  int64
 368 pixel368  int64
 369 pixel369  int64
 370 pixel370  int64
 371 pixel371  int64
 372 pixel372  int64
 373 pixel373  int64
 374 pixel374  int64
 375 pixel375  int64
 376 pixel376  int64
 377 pixel377  int64
 378 pixel378  int64
 379 pixel379  int64
 380 pixel380  int64
 381 pixel381  int64
 382 pixel382  int64
 383 pixel383  int64
 384 pixel384  int64
 385 pixel385  int64
 386 pixel386  int64
 387 pixel387  int64
 388 pixel388  int64
 389 pixel389  int64
 390 pixel390  int64
 391 pixel391  int64
 392 pixel392  int64
 393 pixel393  int64
 394 pixel394  int64
 395 pixel395  int64
 396 pixel396  int64
 397 pixel397  int64
 398 pixel398  int64
 399 pixel399  int64
 400 pixel400  int64
 401 pixel401  int64
 402 pixel402  int64
 403 pixel403  int64
 404 pixel404  int64
 405 pixel405  int64
 406 pixel406  int64
 407 pixel407  int64
 408 pixel408  int64
 409 pixel409  int64
 410 pixel410  int64
 411 pixel411  int64
 412 pixel412  int64
 413 pixel413  int64
 414 pixel414  int64
 415 pixel415  int64
 416 pixel416  int64
 417 pixel417  int64
 418 pixel418  int64
 419 pixel419  int64
 420 pixel420  int64
 421 pixel421  int64
 422 pixel422  int64
 423 pixel423  int64
 424 pixel424  int64
 425 pixel425  int64
 426 pixel426  int64
 427 pixel427  int64
 428 pixel428  int64
 429 pixel429  int64
 430 pixel430  int64
 431 pixel431  int64
 432 pixel432  int64
 433 pixel433  int64
 434 pixel434  int64
 435 pixel435  int64
 436 pixel436  int64
 437 pixel437  int64
 438 pixel438  int64
 439 pixel439  int64
 440 pixel440  int64
 441 pixel441  int64
 442 pixel442  int64
 443 pixel443  int64
 444 pixel444  int64
 445 pixel445  int64
 446 pixel446  int64
 447 pixel447  int64
 448 pixel448  int64
 449 pixel449  int64
 450 pixel450  int64
 451 pixel451  int64
 452 pixel452  int64
 453 pixel453  int64
 454 pixel454  int64
 455 pixel455  int64
 456 pixel456  int64
 457 pixel457  int64
 458 pixel458  int64
 459 pixel459  int64
 460 pixel460  int64
 461 pixel461  int64
 462 pixel462  int64
 463 pixel463  int64
 464 pixel464  int64
 465 pixel465  int64
 466 pixel466  int64
 467 pixel467  int64
 468 pixel468  int64
 469 pixel469  int64
 470 pixel470  int64
 471 pixel471  int64
 472 pixel472  int64
 473 pixel473  int64
 474 pixel474  int64
 475 pixel475  int64
 476 pixel476  int64
 477 pixel477  int64
 478 pixel478  int64
 479 pixel479  int64
 480 pixel480  int64
 481 pixel481  int64
 482 pixel482  int64
 483 pixel483  int64
 484 pixel484  int64
 485 pixel485  int64
 486 pixel486  int64
 487 pixel487  int64
 488 pixel488  int64
 489 pixel489  int64
 490 pixel490  int64
 491 pixel491  int64
 492 pixel492  int64
 493 pixel493  int64
 494 pixel494  int64
 495 pixel495  int64
 496 pixel496  int64
 497 pixel497  int64
 498 pixel498  int64
 499 pixel499  int64
 500 pixel500  int64
 501 pixel501  int64
 502 pixel502  int64
 503 pixel503  int64
 504 pixel504  int64
 505 pixel505  int64
 506 pixel506  int64
 507 pixel507  int64
 508 pixel508  int64
 509 pixel509  int64
 510 pixel510  int64
 511 pixel511  int64
 512 pixel512  int64
 513 pixel513  int64
 514 pixel514  int64
 515 pixel515  int64
 516 pixel516  int64
 517 pixel517  int64
 518 pixel518  int64
 519 pixel519  int64
 520 pixel520  int64
 521 pixel521  int64
 522 pixel522  int64
 523 pixel523  int64
 524 pixel524  int64
 525 pixel525  int64
 526 pixel526  int64
 527 pixel527  int64
 528 pixel528  int64
 529 pixel529  int64
 530 pixel530  int64
 531 pixel531  int64
 532 pixel532  int64
 533 pixel533  int64
 534 pixel534  int64
 535 pixel535  int64
 536 pixel536  int64
 537 pixel537  int64
 538 pixel538  int64
 539 pixel539  int64
 540 pixel540  int64
 541 pixel541  int64
 542 pixel542  int64
 543 pixel543  int64
 544 pixel544  int64
 545 pixel545  int64
 546 pixel546  int64
 547 pixel547  int64
 548 pixel548  int64
 549 pixel549  int64
 550 pixel550  int64
 551 pixel551  int64
 552 pixel552  int64
 553 pixel553  int64
 554 pixel554  int64
 555 pixel555  int64
 556 pixel556  int64
 557 pixel557  int64
 558 pixel558  int64
 559 pixel559  int64
 560 pixel560  int64
 561 pixel561  int64
 562 pixel562  int64
 563 pixel563  int64
 564 pixel564  int64
 565 pixel565  int64
 566 pixel566  int64
 567 pixel567  int64
 568 pixel568  int64
 569 pixel569  int64
 570 pixel570  int64
 571 pixel571  int64
 572 pixel572  int64
 573 pixel573  int64
 574 pixel574  int64
 575 pixel575  int64
 576 pixel576  int64
 577 pixel577  int64
 578 pixel578  int64
 579 pixel579  int64
 580 pixel580  int64
 581 pixel581  int64
 582 pixel582  int64
 583 pixel583  int64
 584 pixel584  int64
 585 pixel585  int64
 586 pixel586  int64
 587 pixel587  int64
 588 pixel588  int64
 589 pixel589  int64
 590 pixel590  int64
 591 pixel591  int64
 592 pixel592  int64
 593 pixel593  int64
 594 pixel594  int64
 595 pixel595  int64
 596 pixel596  int64
 597 pixel597  int64
 598 pixel598  int64
 599 pixel599  int64
 600 pixel600  int64
 601 pixel601  int64
 602 pixel602  int64
 603 pixel603  int64
 604 pixel604  int64
 605 pixel605  int64
 606 pixel606  int64
 607 pixel607  int64
 608 pixel608  int64
 609 pixel609  int64
 610 pixel610  int64
 611 pixel611  int64
 612 pixel612  int64
 613 pixel613  int64
 614 pixel614  int64
 615 pixel615  int64
 616 pixel616  int64
 617 pixel617  int64
 618 pixel618  int64
 619 pixel619  int64
 620 pixel620  int64
 621 pixel621  int64
 622 pixel622  int64
 623 pixel623  int64
 624 pixel624  int64
 625 pixel625  int64
 626 pixel626  int64
 627 pixel627  int64
 628 pixel628  int64
 629 pixel629  int64
 630 pixel630  int64
 631 pixel631  int64
 632 pixel632  int64
 633 pixel633  int64
 634 pixel634  int64
 635 pixel635  int64
 636 pixel636  int64
 637 pixel637  int64
 638 pixel638  int64
 639 pixel639  int64
 640 pixel640  int64
 641 pixel641  int64
 642 pixel642  int64
 643 pixel643  int64
 644 pixel644  int64
 645 pixel645  int64
 646 pixel646  int64
 647 pixel647  int64
 648 pixel648  int64
 649 pixel649  int64
 650 pixel650  int64
 651 pixel651  int64
 652 pixel652  int64
 653 pixel653  int64
 654 pixel654  int64
 655 pixel655  int64
 656 pixel656  int64
 657 pixel657  int64
 658 pixel658  int64
 659 pixel659  int64
 660 pixel660  int64
 661 pixel661  int64
 662 pixel662  int64
 663 pixel663  int64
 664 pixel664  int64
 665 pixel665  int64
 666 pixel666  int64
 667 pixel667  int64
 668 pixel668  int64
 669 pixel669  int64
 670 pixel670  int64
 671 pixel671  int64
 672 pixel672  int64
 673 pixel673  int64
 674 pixel674  int64
 675 pixel675  int64
 676 pixel676  int64
 677 pixel677  int64
 678 pixel678  int64
 679 pixel679  int64
 680 pixel680  int64
 681 pixel681  int64
 682 pixel682  int64
 683 pixel683  int64
 684 pixel684  int64
 685 pixel685  int64
 686 pixel686  int64
 687 pixel687  int64
 688 pixel688  int64
 689 pixel689  int64
 690 pixel690  int64
 691 pixel691  int64
 692 pixel692  int64
 693 pixel693  int64
 694 pixel694  int64
 695 pixel695  int64
 696 pixel696  int64
 697 pixel697  int64
 698 pixel698  int64
 699 pixel699  int64
 700 pixel700  int64
 701 pixel701  int64
 702 pixel702  int64
 703 pixel703  int64
 704 pixel704  int64
 705 pixel705  int64
 706 pixel706  int64
 707 pixel707  int64
 708 pixel708  int64
 709 pixel709  int64
 710 pixel710  int64
 711 pixel711  int64
 712 pixel712  int64
 713 pixel713  int64
 714 pixel714  int64
 715 pixel715  int64
 716 pixel716  int64
 717 pixel717  int64
 718 pixel718  int64
 719 pixel719  int64
 720 pixel720  int64
 721 pixel721  int64
 722 pixel722  int64
 723 pixel723  int64
 724 pixel724  int64
 725 pixel725  int64
 726 pixel726  int64
 727 pixel727  int64
 728 pixel728  int64
 729 pixel729  int64
 730 pixel730  int64
 731 pixel731  int64
 732 pixel732  int64
 733 pixel733  int64
 734 pixel734  int64
 735 pixel735  int64
 736 pixel736  int64
 737 pixel737  int64
 738 pixel738  int64
 739 pixel739  int64
 740 pixel740  int64
 741 pixel741  int64
 742 pixel742  int64
 743 pixel743  int64
 744 pixel744  int64
 745 pixel745  int64
 746 pixel746  int64
 747 pixel747  int64
 748 pixel748  int64
 749 pixel749  int64
 750 pixel750  int64
 751 pixel751  int64
 752 pixel752  int64
 753 pixel753  int64
 754 pixel754  int64
 755 pixel755  int64
 756 pixel756  int64
 757 pixel757  int64
 758 pixel758  int64
 759 pixel759  int64
 760 pixel760  int64
 761 pixel761  int64
 762 pixel762  int64
 763 pixel763  int64
 764 pixel764  int64
 765 pixel765  int64
 766 pixel766  int64
 767 pixel767  int64
 768 pixel768  int64
 769 pixel769  int64
 770 pixel770  int64
 771 pixel771  int64
 772 pixel772  int64
 773 pixel773  int64
 774 pixel774  int64
 775 pixel775  int64
 776 pixel776  int64
 777 pixel777  int64
 778 pixel778  int64
 779 pixel779  int64
 780 pixel780  int64
 781 pixel781  int64
 782 pixel782  int64
 783 pixel783  int64
 784 pixel784  int64
dtypes: int64(785)
memory usage: 164.4 MB
In [13]:
Xy_original.describe()
Out[13]:
label pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
count 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 ... 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000
mean 12.318813 145.419377 148.500273 151.247714 153.546531 156.210891 158.411255 160.472154 162.339683 163.954799 ... 141.104863 147.495611 153.325806 159.125332 161.969259 162.736696 162.906137 161.966454 161.137898 159.824731
std 7.287552 41.358555 39.942152 39.056286 38.595247 37.111165 36.125579 35.016392 33.661998 32.651607 ... 63.751194 65.512894 64.427412 63.708507 63.738316 63.444008 63.509210 63.298721 63.610415 64.396846
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 6.000000 121.000000 126.000000 130.000000 133.000000 137.000000 140.000000 142.000000 144.000000 146.000000 ... 92.000000 96.000000 103.000000 112.000000 120.000000 125.000000 128.000000 128.000000 128.000000 125.500000
50% 13.000000 150.000000 153.000000 156.000000 158.000000 160.000000 162.000000 164.000000 165.000000 166.000000 ... 144.000000 162.000000 172.000000 180.000000 183.000000 184.000000 184.000000 182.000000 182.000000 182.000000
75% 19.000000 174.000000 176.000000 178.000000 179.000000 181.000000 182.000000 183.000000 184.000000 185.000000 ... 196.000000 202.000000 205.000000 207.000000 208.000000 207.000000 207.000000 206.000000 204.000000 204.000000
max 24.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 ... 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000

8 rows × 785 columns

In [14]:
Xy_original.isnull().sum()
Out[14]:
label       0
pixel1      0
pixel2      0
pixel3      0
pixel4      0
           ..
pixel780    0
pixel781    0
pixel782    0
pixel783    0
pixel784    0
Length: 785, dtype: int64
In [15]:
print('Total number of NaN in the dataframe: ', Xy_original.isnull().sum().sum())
Total number of NaN in the dataframe:  0

1.d) Data Cleaning

In [16]:
# Standardize the class column to the name of targetVar if required
Xy_original = Xy_original.rename(columns={'label': 'targetVar'})

1.e) Splitting Data into Attribute-only and Target-only Sets

In [17]:
# Use variable totCol to hold the number of columns in the dataframe
totCol = len(Xy_original.columns)

# Set up variable totAttr for the total number of attribute columns
totAttr = totCol-1

# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# If (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization
targetCol = 1
In [18]:
# We create attribute-only and target-only datasets (X_original and y_original)
# for various visualization and cleaning/transformation operations

if targetCol == totCol:
    X_original = Xy_original.iloc[:,0:totAttr]
    y_original = Xy_original.iloc[:,totAttr]
else:
    X_original = Xy_original.iloc[:,1:totCol]
    y_original = Xy_original.iloc[:,0]

print("Xy_original.shape: {} X_original.shape: {} y_original.shape: {}".format(Xy_original.shape, X_original.shape, y_original.shape))
Xy_original.shape: (27455, 785) X_original.shape: (27455, 784) y_original.shape: (27455,)

1.f) Set up the parameters for data visualization

In [19]:
# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol = 4
if totAttr % dispCol == 0 :
    dispRow = totAttr // dispCol
else :
    dispRow = (totAttr // dispCol) + 1
    
# Set figure width to display the data visualization plots
fig_size = plt.rcParams["figure.figsize"]
fig_size[0] = dispCol*4
fig_size[1] = dispRow*4
plt.rcParams["figure.figsize"] = fig_size
In [20]:
if notifyStatus: status_notify("Task 1 - Prepare Environment completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 2 - Summarize and Visualize Data

In [21]:
if notifyStatus: status_notify("Task 2 - Summarize and Visualize Data has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

2.a) Descriptive Statistics

2.a.i) Peek at the attribute columns

In [22]:
X_original.head()
Out[22]:
pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
0 107 118 127 134 139 143 146 150 153 156 ... 207 207 207 207 206 206 206 204 203 202
1 155 157 156 156 156 157 156 158 158 157 ... 69 149 128 87 94 163 175 103 135 149
2 187 188 188 187 187 186 187 188 187 186 ... 202 201 200 199 198 199 198 195 194 195
3 211 211 212 212 211 210 211 210 210 211 ... 235 234 233 231 230 226 225 222 229 163
4 164 167 170 172 176 179 180 184 185 186 ... 92 105 105 108 133 163 157 163 164 179

5 rows × 784 columns

2.a.ii) Dimensions and attribute types

In [23]:
X_original.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27455 entries, 0 to 27454
Data columns (total 784 columns):
 #   Column    Dtype
---  ------    -----
 0   pixel1    int64
 1   pixel2    int64
 2   pixel3    int64
 3   pixel4    int64
 4   pixel5    int64
 5   pixel6    int64
 6   pixel7    int64
 7   pixel8    int64
 8   pixel9    int64
 9   pixel10   int64
 10  pixel11   int64
 11  pixel12   int64
 12  pixel13   int64
 13  pixel14   int64
 14  pixel15   int64
 15  pixel16   int64
 16  pixel17   int64
 17  pixel18   int64
 18  pixel19   int64
 19  pixel20   int64
 20  pixel21   int64
 21  pixel22   int64
 22  pixel23   int64
 23  pixel24   int64
 24  pixel25   int64
 25  pixel26   int64
 26  pixel27   int64
 27  pixel28   int64
 28  pixel29   int64
 29  pixel30   int64
 30  pixel31   int64
 31  pixel32   int64
 32  pixel33   int64
 33  pixel34   int64
 34  pixel35   int64
 35  pixel36   int64
 36  pixel37   int64
 37  pixel38   int64
 38  pixel39   int64
 39  pixel40   int64
 40  pixel41   int64
 41  pixel42   int64
 42  pixel43   int64
 43  pixel44   int64
 44  pixel45   int64
 45  pixel46   int64
 46  pixel47   int64
 47  pixel48   int64
 48  pixel49   int64
 49  pixel50   int64
 50  pixel51   int64
 51  pixel52   int64
 52  pixel53   int64
 53  pixel54   int64
 54  pixel55   int64
 55  pixel56   int64
 56  pixel57   int64
 57  pixel58   int64
 58  pixel59   int64
 59  pixel60   int64
 60  pixel61   int64
 61  pixel62   int64
 62  pixel63   int64
 63  pixel64   int64
 64  pixel65   int64
 65  pixel66   int64
 66  pixel67   int64
 67  pixel68   int64
 68  pixel69   int64
 69  pixel70   int64
 70  pixel71   int64
 71  pixel72   int64
 72  pixel73   int64
 73  pixel74   int64
 74  pixel75   int64
 75  pixel76   int64
 76  pixel77   int64
 77  pixel78   int64
 78  pixel79   int64
 79  pixel80   int64
 80  pixel81   int64
 81  pixel82   int64
 82  pixel83   int64
 83  pixel84   int64
 84  pixel85   int64
 85  pixel86   int64
 86  pixel87   int64
 87  pixel88   int64
 88  pixel89   int64
 89  pixel90   int64
 90  pixel91   int64
 91  pixel92   int64
 92  pixel93   int64
 93  pixel94   int64
 94  pixel95   int64
 95  pixel96   int64
 96  pixel97   int64
 97  pixel98   int64
 98  pixel99   int64
 99  pixel100  int64
 100 pixel101  int64
 101 pixel102  int64
 102 pixel103  int64
 103 pixel104  int64
 104 pixel105  int64
 105 pixel106  int64
 106 pixel107  int64
 107 pixel108  int64
 108 pixel109  int64
 109 pixel110  int64
 110 pixel111  int64
 111 pixel112  int64
 112 pixel113  int64
 113 pixel114  int64
 114 pixel115  int64
 115 pixel116  int64
 116 pixel117  int64
 117 pixel118  int64
 118 pixel119  int64
 119 pixel120  int64
 120 pixel121  int64
 121 pixel122  int64
 122 pixel123  int64
 123 pixel124  int64
 124 pixel125  int64
 125 pixel126  int64
 126 pixel127  int64
 127 pixel128  int64
 128 pixel129  int64
 129 pixel130  int64
 130 pixel131  int64
 131 pixel132  int64
 132 pixel133  int64
 133 pixel134  int64
 134 pixel135  int64
 135 pixel136  int64
 136 pixel137  int64
 137 pixel138  int64
 138 pixel139  int64
 139 pixel140  int64
 140 pixel141  int64
 141 pixel142  int64
 142 pixel143  int64
 143 pixel144  int64
 144 pixel145  int64
 145 pixel146  int64
 146 pixel147  int64
 147 pixel148  int64
 148 pixel149  int64
 149 pixel150  int64
 150 pixel151  int64
 151 pixel152  int64
 152 pixel153  int64
 153 pixel154  int64
 154 pixel155  int64
 155 pixel156  int64
 156 pixel157  int64
 157 pixel158  int64
 158 pixel159  int64
 159 pixel160  int64
 160 pixel161  int64
 161 pixel162  int64
 162 pixel163  int64
 163 pixel164  int64
 164 pixel165  int64
 165 pixel166  int64
 166 pixel167  int64
 167 pixel168  int64
 168 pixel169  int64
 169 pixel170  int64
 170 pixel171  int64
 171 pixel172  int64
 172 pixel173  int64
 173 pixel174  int64
 174 pixel175  int64
 175 pixel176  int64
 176 pixel177  int64
 177 pixel178  int64
 178 pixel179  int64
 179 pixel180  int64
 180 pixel181  int64
 181 pixel182  int64
 182 pixel183  int64
 183 pixel184  int64
 184 pixel185  int64
 185 pixel186  int64
 186 pixel187  int64
 187 pixel188  int64
 188 pixel189  int64
 189 pixel190  int64
 190 pixel191  int64
 191 pixel192  int64
 192 pixel193  int64
 193 pixel194  int64
 194 pixel195  int64
 195 pixel196  int64
 196 pixel197  int64
 197 pixel198  int64
 198 pixel199  int64
 199 pixel200  int64
 200 pixel201  int64
 201 pixel202  int64
 202 pixel203  int64
 203 pixel204  int64
 204 pixel205  int64
 205 pixel206  int64
 206 pixel207  int64
 207 pixel208  int64
 208 pixel209  int64
 209 pixel210  int64
 210 pixel211  int64
 211 pixel212  int64
 212 pixel213  int64
 213 pixel214  int64
 214 pixel215  int64
 215 pixel216  int64
 216 pixel217  int64
 217 pixel218  int64
 218 pixel219  int64
 219 pixel220  int64
 220 pixel221  int64
 221 pixel222  int64
 222 pixel223  int64
 223 pixel224  int64
 224 pixel225  int64
 225 pixel226  int64
 226 pixel227  int64
 227 pixel228  int64
 228 pixel229  int64
 229 pixel230  int64
 230 pixel231  int64
 231 pixel232  int64
 232 pixel233  int64
 233 pixel234  int64
 234 pixel235  int64
 235 pixel236  int64
 236 pixel237  int64
 237 pixel238  int64
 238 pixel239  int64
 239 pixel240  int64
 240 pixel241  int64
 241 pixel242  int64
 242 pixel243  int64
 243 pixel244  int64
 244 pixel245  int64
 245 pixel246  int64
 246 pixel247  int64
 247 pixel248  int64
 248 pixel249  int64
 249 pixel250  int64
 250 pixel251  int64
 251 pixel252  int64
 252 pixel253  int64
 253 pixel254  int64
 254 pixel255  int64
 255 pixel256  int64
 256 pixel257  int64
 257 pixel258  int64
 258 pixel259  int64
 259 pixel260  int64
 260 pixel261  int64
 261 pixel262  int64
 262 pixel263  int64
 263 pixel264  int64
 264 pixel265  int64
 265 pixel266  int64
 266 pixel267  int64
 267 pixel268  int64
 268 pixel269  int64
 269 pixel270  int64
 270 pixel271  int64
 271 pixel272  int64
 272 pixel273  int64
 273 pixel274  int64
 274 pixel275  int64
 275 pixel276  int64
 276 pixel277  int64
 277 pixel278  int64
 278 pixel279  int64
 279 pixel280  int64
 280 pixel281  int64
 281 pixel282  int64
 282 pixel283  int64
 283 pixel284  int64
 284 pixel285  int64
 285 pixel286  int64
 286 pixel287  int64
 287 pixel288  int64
 288 pixel289  int64
 289 pixel290  int64
 290 pixel291  int64
 291 pixel292  int64
 292 pixel293  int64
 293 pixel294  int64
 294 pixel295  int64
 295 pixel296  int64
 296 pixel297  int64
 297 pixel298  int64
 298 pixel299  int64
 299 pixel300  int64
 300 pixel301  int64
 301 pixel302  int64
 302 pixel303  int64
 303 pixel304  int64
 304 pixel305  int64
 305 pixel306  int64
 306 pixel307  int64
 307 pixel308  int64
 308 pixel309  int64
 309 pixel310  int64
 310 pixel311  int64
 311 pixel312  int64
 312 pixel313  int64
 313 pixel314  int64
 314 pixel315  int64
 315 pixel316  int64
 316 pixel317  int64
 317 pixel318  int64
 318 pixel319  int64
 319 pixel320  int64
 320 pixel321  int64
 321 pixel322  int64
 322 pixel323  int64
 323 pixel324  int64
 324 pixel325  int64
 325 pixel326  int64
 326 pixel327  int64
 327 pixel328  int64
 328 pixel329  int64
 329 pixel330  int64
 330 pixel331  int64
 331 pixel332  int64
 332 pixel333  int64
 333 pixel334  int64
 334 pixel335  int64
 335 pixel336  int64
 336 pixel337  int64
 337 pixel338  int64
 338 pixel339  int64
 339 pixel340  int64
 340 pixel341  int64
 341 pixel342  int64
 342 pixel343  int64
 343 pixel344  int64
 344 pixel345  int64
 345 pixel346  int64
 346 pixel347  int64
 347 pixel348  int64
 348 pixel349  int64
 349 pixel350  int64
 350 pixel351  int64
 351 pixel352  int64
 352 pixel353  int64
 353 pixel354  int64
 354 pixel355  int64
 355 pixel356  int64
 356 pixel357  int64
 357 pixel358  int64
 358 pixel359  int64
 359 pixel360  int64
 360 pixel361  int64
 361 pixel362  int64
 362 pixel363  int64
 363 pixel364  int64
 364 pixel365  int64
 365 pixel366  int64
 366 pixel367  int64
 367 pixel368  int64
 368 pixel369  int64
 369 pixel370  int64
 370 pixel371  int64
 371 pixel372  int64
 372 pixel373  int64
 373 pixel374  int64
 374 pixel375  int64
 375 pixel376  int64
 376 pixel377  int64
 377 pixel378  int64
 378 pixel379  int64
 379 pixel380  int64
 380 pixel381  int64
 381 pixel382  int64
 382 pixel383  int64
 383 pixel384  int64
 384 pixel385  int64
 385 pixel386  int64
 386 pixel387  int64
 387 pixel388  int64
 388 pixel389  int64
 389 pixel390  int64
 390 pixel391  int64
 391 pixel392  int64
 392 pixel393  int64
 393 pixel394  int64
 394 pixel395  int64
 395 pixel396  int64
 396 pixel397  int64
 397 pixel398  int64
 398 pixel399  int64
 399 pixel400  int64
 400 pixel401  int64
 401 pixel402  int64
 402 pixel403  int64
 403 pixel404  int64
 404 pixel405  int64
 405 pixel406  int64
 406 pixel407  int64
 407 pixel408  int64
 408 pixel409  int64
 409 pixel410  int64
 410 pixel411  int64
 411 pixel412  int64
 412 pixel413  int64
 413 pixel414  int64
 414 pixel415  int64
 415 pixel416  int64
 416 pixel417  int64
 417 pixel418  int64
 418 pixel419  int64
 419 pixel420  int64
 420 pixel421  int64
 421 pixel422  int64
 422 pixel423  int64
 423 pixel424  int64
 424 pixel425  int64
 425 pixel426  int64
 426 pixel427  int64
 427 pixel428  int64
 428 pixel429  int64
 429 pixel430  int64
 430 pixel431  int64
 431 pixel432  int64
 432 pixel433  int64
 433 pixel434  int64
 434 pixel435  int64
 435 pixel436  int64
 436 pixel437  int64
 437 pixel438  int64
 438 pixel439  int64
 439 pixel440  int64
 440 pixel441  int64
 441 pixel442  int64
 442 pixel443  int64
 443 pixel444  int64
 444 pixel445  int64
 445 pixel446  int64
 446 pixel447  int64
 447 pixel448  int64
 448 pixel449  int64
 449 pixel450  int64
 450 pixel451  int64
 451 pixel452  int64
 452 pixel453  int64
 453 pixel454  int64
 454 pixel455  int64
 455 pixel456  int64
 456 pixel457  int64
 457 pixel458  int64
 458 pixel459  int64
 459 pixel460  int64
 460 pixel461  int64
 461 pixel462  int64
 462 pixel463  int64
 463 pixel464  int64
 464 pixel465  int64
 465 pixel466  int64
 466 pixel467  int64
 467 pixel468  int64
 468 pixel469  int64
 469 pixel470  int64
 470 pixel471  int64
 471 pixel472  int64
 472 pixel473  int64
 473 pixel474  int64
 474 pixel475  int64
 475 pixel476  int64
 476 pixel477  int64
 477 pixel478  int64
 478 pixel479  int64
 479 pixel480  int64
 480 pixel481  int64
 481 pixel482  int64
 482 pixel483  int64
 483 pixel484  int64
 484 pixel485  int64
 485 pixel486  int64
 486 pixel487  int64
 487 pixel488  int64
 488 pixel489  int64
 489 pixel490  int64
 490 pixel491  int64
 491 pixel492  int64
 492 pixel493  int64
 493 pixel494  int64
 494 pixel495  int64
 495 pixel496  int64
 496 pixel497  int64
 497 pixel498  int64
 498 pixel499  int64
 499 pixel500  int64
 500 pixel501  int64
 501 pixel502  int64
 502 pixel503  int64
 503 pixel504  int64
 504 pixel505  int64
 505 pixel506  int64
 506 pixel507  int64
 507 pixel508  int64
 508 pixel509  int64
 509 pixel510  int64
 510 pixel511  int64
 511 pixel512  int64
 512 pixel513  int64
 513 pixel514  int64
 514 pixel515  int64
 515 pixel516  int64
 516 pixel517  int64
 517 pixel518  int64
 518 pixel519  int64
 519 pixel520  int64
 520 pixel521  int64
 521 pixel522  int64
 522 pixel523  int64
 523 pixel524  int64
 524 pixel525  int64
 525 pixel526  int64
 526 pixel527  int64
 527 pixel528  int64
 528 pixel529  int64
 529 pixel530  int64
 530 pixel531  int64
 531 pixel532  int64
 532 pixel533  int64
 533 pixel534  int64
 534 pixel535  int64
 535 pixel536  int64
 536 pixel537  int64
 537 pixel538  int64
 538 pixel539  int64
 539 pixel540  int64
 540 pixel541  int64
 541 pixel542  int64
 542 pixel543  int64
 543 pixel544  int64
 544 pixel545  int64
 545 pixel546  int64
 546 pixel547  int64
 547 pixel548  int64
 548 pixel549  int64
 549 pixel550  int64
 550 pixel551  int64
 551 pixel552  int64
 552 pixel553  int64
 553 pixel554  int64
 554 pixel555  int64
 555 pixel556  int64
 556 pixel557  int64
 557 pixel558  int64
 558 pixel559  int64
 559 pixel560  int64
 560 pixel561  int64
 561 pixel562  int64
 562 pixel563  int64
 563 pixel564  int64
 564 pixel565  int64
 565 pixel566  int64
 566 pixel567  int64
 567 pixel568  int64
 568 pixel569  int64
 569 pixel570  int64
 570 pixel571  int64
 571 pixel572  int64
 572 pixel573  int64
 573 pixel574  int64
 574 pixel575  int64
 575 pixel576  int64
 576 pixel577  int64
 577 pixel578  int64
 578 pixel579  int64
 579 pixel580  int64
 580 pixel581  int64
 581 pixel582  int64
 582 pixel583  int64
 583 pixel584  int64
 584 pixel585  int64
 585 pixel586  int64
 586 pixel587  int64
 587 pixel588  int64
 588 pixel589  int64
 589 pixel590  int64
 590 pixel591  int64
 591 pixel592  int64
 592 pixel593  int64
 593 pixel594  int64
 594 pixel595  int64
 595 pixel596  int64
 596 pixel597  int64
 597 pixel598  int64
 598 pixel599  int64
 599 pixel600  int64
 600 pixel601  int64
 601 pixel602  int64
 602 pixel603  int64
 603 pixel604  int64
 604 pixel605  int64
 605 pixel606  int64
 606 pixel607  int64
 607 pixel608  int64
 608 pixel609  int64
 609 pixel610  int64
 610 pixel611  int64
 611 pixel612  int64
 612 pixel613  int64
 613 pixel614  int64
 614 pixel615  int64
 615 pixel616  int64
 616 pixel617  int64
 617 pixel618  int64
 618 pixel619  int64
 619 pixel620  int64
 620 pixel621  int64
 621 pixel622  int64
 622 pixel623  int64
 623 pixel624  int64
 624 pixel625  int64
 625 pixel626  int64
 626 pixel627  int64
 627 pixel628  int64
 628 pixel629  int64
 629 pixel630  int64
 630 pixel631  int64
 631 pixel632  int64
 632 pixel633  int64
 633 pixel634  int64
 634 pixel635  int64
 635 pixel636  int64
 636 pixel637  int64
 637 pixel638  int64
 638 pixel639  int64
 639 pixel640  int64
 640 pixel641  int64
 641 pixel642  int64
 642 pixel643  int64
 643 pixel644  int64
 644 pixel645  int64
 645 pixel646  int64
 646 pixel647  int64
 647 pixel648  int64
 648 pixel649  int64
 649 pixel650  int64
 650 pixel651  int64
 651 pixel652  int64
 652 pixel653  int64
 653 pixel654  int64
 654 pixel655  int64
 655 pixel656  int64
 656 pixel657  int64
 657 pixel658  int64
 658 pixel659  int64
 659 pixel660  int64
 660 pixel661  int64
 661 pixel662  int64
 662 pixel663  int64
 663 pixel664  int64
 664 pixel665  int64
 665 pixel666  int64
 666 pixel667  int64
 667 pixel668  int64
 668 pixel669  int64
 669 pixel670  int64
 670 pixel671  int64
 671 pixel672  int64
 672 pixel673  int64
 673 pixel674  int64
 674 pixel675  int64
 675 pixel676  int64
 676 pixel677  int64
 677 pixel678  int64
 678 pixel679  int64
 679 pixel680  int64
 680 pixel681  int64
 681 pixel682  int64
 682 pixel683  int64
 683 pixel684  int64
 684 pixel685  int64
 685 pixel686  int64
 686 pixel687  int64
 687 pixel688  int64
 688 pixel689  int64
 689 pixel690  int64
 690 pixel691  int64
 691 pixel692  int64
 692 pixel693  int64
 693 pixel694  int64
 694 pixel695  int64
 695 pixel696  int64
 696 pixel697  int64
 697 pixel698  int64
 698 pixel699  int64
 699 pixel700  int64
 700 pixel701  int64
 701 pixel702  int64
 702 pixel703  int64
 703 pixel704  int64
 704 pixel705  int64
 705 pixel706  int64
 706 pixel707  int64
 707 pixel708  int64
 708 pixel709  int64
 709 pixel710  int64
 710 pixel711  int64
 711 pixel712  int64
 712 pixel713  int64
 713 pixel714  int64
 714 pixel715  int64
 715 pixel716  int64
 716 pixel717  int64
 717 pixel718  int64
 718 pixel719  int64
 719 pixel720  int64
 720 pixel721  int64
 721 pixel722  int64
 722 pixel723  int64
 723 pixel724  int64
 724 pixel725  int64
 725 pixel726  int64
 726 pixel727  int64
 727 pixel728  int64
 728 pixel729  int64
 729 pixel730  int64
 730 pixel731  int64
 731 pixel732  int64
 732 pixel733  int64
 733 pixel734  int64
 734 pixel735  int64
 735 pixel736  int64
 736 pixel737  int64
 737 pixel738  int64
 738 pixel739  int64
 739 pixel740  int64
 740 pixel741  int64
 741 pixel742  int64
 742 pixel743  int64
 743 pixel744  int64
 744 pixel745  int64
 745 pixel746  int64
 746 pixel747  int64
 747 pixel748  int64
 748 pixel749  int64
 749 pixel750  int64
 750 pixel751  int64
 751 pixel752  int64
 752 pixel753  int64
 753 pixel754  int64
 754 pixel755  int64
 755 pixel756  int64
 756 pixel757  int64
 757 pixel758  int64
 758 pixel759  int64
 759 pixel760  int64
 760 pixel761  int64
 761 pixel762  int64
 762 pixel763  int64
 763 pixel764  int64
 764 pixel765  int64
 765 pixel766  int64
 766 pixel767  int64
 767 pixel768  int64
 768 pixel769  int64
 769 pixel770  int64
 770 pixel771  int64
 771 pixel772  int64
 772 pixel773  int64
 773 pixel774  int64
 774 pixel775  int64
 775 pixel776  int64
 776 pixel777  int64
 777 pixel778  int64
 778 pixel779  int64
 779 pixel780  int64
 780 pixel781  int64
 781 pixel782  int64
 782 pixel783  int64
 783 pixel784  int64
dtypes: int64(784)
memory usage: 164.2 MB

2.a.iii) Statistical summary of the attributes

In [24]:
X_original.describe()
Out[24]:
pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 pixel10 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
count 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 ... 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000 27455.000000
mean 145.419377 148.500273 151.247714 153.546531 156.210891 158.411255 160.472154 162.339683 163.954799 165.533673 ... 141.104863 147.495611 153.325806 159.125332 161.969259 162.736696 162.906137 161.966454 161.137898 159.824731
std 41.358555 39.942152 39.056286 38.595247 37.111165 36.125579 35.016392 33.661998 32.651607 31.279244 ... 63.751194 65.512894 64.427412 63.708507 63.738316 63.444008 63.509210 63.298721 63.610415 64.396846
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 121.000000 126.000000 130.000000 133.000000 137.000000 140.000000 142.000000 144.000000 146.000000 148.000000 ... 92.000000 96.000000 103.000000 112.000000 120.000000 125.000000 128.000000 128.000000 128.000000 125.500000
50% 150.000000 153.000000 156.000000 158.000000 160.000000 162.000000 164.000000 165.000000 166.000000 167.000000 ... 144.000000 162.000000 172.000000 180.000000 183.000000 184.000000 184.000000 182.000000 182.000000 182.000000
75% 174.000000 176.000000 178.000000 179.000000 181.000000 182.000000 183.000000 184.000000 185.000000 186.000000 ... 196.000000 202.000000 205.000000 207.000000 208.000000 207.000000 207.000000 206.000000 204.000000 204.000000
max 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 ... 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000 255.000000

8 rows × 784 columns

2.a.iv) Summarize the levels of the class attribute

In [25]:
Xy_original.groupby('targetVar').size()
Out[25]:
targetVar
0     1126
1     1010
2     1144
3     1196
4      957
5     1204
6     1090
7     1013
8     1162
10    1114
11    1241
12    1055
13    1151
14    1196
15    1088
16    1279
17    1294
18    1199
19    1186
20    1161
21    1082
22    1225
23    1164
24    1118
dtype: int64

2.b) Data Visualization

In [26]:
# Histograms for each attribute
X_original.hist(layout=(dispRow,dispCol))
plt.show()
In [27]:
# Box and Whisker plot for each attribute
X_original.plot(kind='box', subplots=True, layout=(dispRow,dispCol))
plt.show()
In [28]:
# Correlation matrix
fig = plt.figure(figsize=(16,12))
ax = fig.add_subplot(111)
correlations = X_original.corr(method='pearson')
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
plt.show()
In [29]:
if notifyStatus: status_notify("Task 2 - Summarize and Visualize Data completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 3 - Pre-process Data

In [30]:
if notifyStatus: status_notify("Task 3 - Pre-process Data has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

3.a) Splitting Data into Training and Test Sets

In [31]:
# Split the data further into training and validation datasets
X_train_df, X_validation_df, y_train_df, y_validation_df = train_test_split(X_original, y_original, test_size=val_set_size, stratify=y_original, random_state=seedNum)
print("X_train_df.shape: {} y_train_df.shape: {}".format(X_train_df.shape, y_train_df.shape))
print("X_validation_df.shape: {} y_validation_df.shape: {}".format(X_validation_df.shape, y_validation_df.shape))
X_train_df.shape: (20591, 784) y_train_df.shape: (20591,)
X_validation_df.shape: (6864, 784) y_validation_df.shape: (6864,)

3.b) Feature Scaling and Data Pre-Processing

In [33]:
# Apply feature scaling and transformation
columns_to_scale = X_train_df.columns[X_train_df.dtypes == 'int'].tolist()
print('Columns to scale are:', columns_to_scale)
scaler = preprocessing.StandardScaler()
X_train_df[columns_to_scale] = scaler.fit_transform(X_train_df[columns_to_scale])
print(X_train_df.head())
Columns to scale are: ['pixel1', 'pixel2', 'pixel3', 'pixel4', 'pixel5', 'pixel6', 'pixel7', 'pixel8', 'pixel9', 'pixel10', 'pixel11', 'pixel12', 'pixel13', 'pixel14', 'pixel15', 'pixel16', 'pixel17', 'pixel18', 'pixel19', 'pixel20', 'pixel21', 'pixel22', 'pixel23', 'pixel24', 'pixel25', 'pixel26', 'pixel27', 'pixel28', 'pixel29', 'pixel30', 'pixel31', 'pixel32', 'pixel33', 'pixel34', 'pixel35', 'pixel36', 'pixel37', 'pixel38', 'pixel39', 'pixel40', 'pixel41', 'pixel42', 'pixel43', 'pixel44', 'pixel45', 'pixel46', 'pixel47', 'pixel48', 'pixel49', 'pixel50', 'pixel51', 'pixel52', 'pixel53', 'pixel54', 'pixel55', 'pixel56', 'pixel57', 'pixel58', 'pixel59', 'pixel60', 'pixel61', 'pixel62', 'pixel63', 'pixel64', 'pixel65', 'pixel66', 'pixel67', 'pixel68', 'pixel69', 'pixel70', 'pixel71', 'pixel72', 'pixel73', 'pixel74', 'pixel75', 'pixel76', 'pixel77', 'pixel78', 'pixel79', 'pixel80', 'pixel81', 'pixel82', 'pixel83', 'pixel84', 'pixel85', 'pixel86', 'pixel87', 'pixel88', 'pixel89', 'pixel90', 'pixel91', 'pixel92', 'pixel93', 'pixel94', 'pixel95', 'pixel96', 'pixel97', 'pixel98', 'pixel99', 'pixel100', 'pixel101', 'pixel102', 'pixel103', 'pixel104', 'pixel105', 'pixel106', 'pixel107', 'pixel108', 'pixel109', 'pixel110', 'pixel111', 'pixel112', 'pixel113', 'pixel114', 'pixel115', 'pixel116', 'pixel117', 'pixel118', 'pixel119', 'pixel120', 'pixel121', 'pixel122', 'pixel123', 'pixel124', 'pixel125', 'pixel126', 'pixel127', 'pixel128', 'pixel129', 'pixel130', 'pixel131', 'pixel132', 'pixel133', 'pixel134', 'pixel135', 'pixel136', 'pixel137', 'pixel138', 'pixel139', 'pixel140', 'pixel141', 'pixel142', 'pixel143', 'pixel144', 'pixel145', 'pixel146', 'pixel147', 'pixel148', 'pixel149', 'pixel150', 'pixel151', 'pixel152', 'pixel153', 'pixel154', 'pixel155', 'pixel156', 'pixel157', 'pixel158', 'pixel159', 'pixel160', 'pixel161', 'pixel162', 'pixel163', 'pixel164', 'pixel165', 'pixel166', 'pixel167', 'pixel168', 'pixel169', 'pixel170', 'pixel171', 'pixel172', 'pixel173', 'pixel174', 'pixel175', 'pixel176', 'pixel177', 'pixel178', 'pixel179', 'pixel180', 'pixel181', 'pixel182', 'pixel183', 'pixel184', 'pixel185', 'pixel186', 'pixel187', 'pixel188', 'pixel189', 'pixel190', 'pixel191', 'pixel192', 'pixel193', 'pixel194', 'pixel195', 'pixel196', 'pixel197', 'pixel198', 'pixel199', 'pixel200', 'pixel201', 'pixel202', 'pixel203', 'pixel204', 'pixel205', 'pixel206', 'pixel207', 'pixel208', 'pixel209', 'pixel210', 'pixel211', 'pixel212', 'pixel213', 'pixel214', 'pixel215', 'pixel216', 'pixel217', 'pixel218', 'pixel219', 'pixel220', 'pixel221', 'pixel222', 'pixel223', 'pixel224', 'pixel225', 'pixel226', 'pixel227', 'pixel228', 'pixel229', 'pixel230', 'pixel231', 'pixel232', 'pixel233', 'pixel234', 'pixel235', 'pixel236', 'pixel237', 'pixel238', 'pixel239', 'pixel240', 'pixel241', 'pixel242', 'pixel243', 'pixel244', 'pixel245', 'pixel246', 'pixel247', 'pixel248', 'pixel249', 'pixel250', 'pixel251', 'pixel252', 'pixel253', 'pixel254', 'pixel255', 'pixel256', 'pixel257', 'pixel258', 'pixel259', 'pixel260', 'pixel261', 'pixel262', 'pixel263', 'pixel264', 'pixel265', 'pixel266', 'pixel267', 'pixel268', 'pixel269', 'pixel270', 'pixel271', 'pixel272', 'pixel273', 'pixel274', 'pixel275', 'pixel276', 'pixel277', 'pixel278', 'pixel279', 'pixel280', 'pixel281', 'pixel282', 'pixel283', 'pixel284', 'pixel285', 'pixel286', 'pixel287', 'pixel288', 'pixel289', 'pixel290', 'pixel291', 'pixel292', 'pixel293', 'pixel294', 'pixel295', 'pixel296', 'pixel297', 'pixel298', 'pixel299', 'pixel300', 'pixel301', 'pixel302', 'pixel303', 'pixel304', 'pixel305', 'pixel306', 'pixel307', 'pixel308', 'pixel309', 'pixel310', 'pixel311', 'pixel312', 'pixel313', 'pixel314', 'pixel315', 'pixel316', 'pixel317', 'pixel318', 'pixel319', 'pixel320', 'pixel321', 'pixel322', 'pixel323', 'pixel324', 'pixel325', 'pixel326', 'pixel327', 'pixel328', 'pixel329', 'pixel330', 'pixel331', 'pixel332', 'pixel333', 'pixel334', 'pixel335', 'pixel336', 'pixel337', 'pixel338', 'pixel339', 'pixel340', 'pixel341', 'pixel342', 'pixel343', 'pixel344', 'pixel345', 'pixel346', 'pixel347', 'pixel348', 'pixel349', 'pixel350', 'pixel351', 'pixel352', 'pixel353', 'pixel354', 'pixel355', 'pixel356', 'pixel357', 'pixel358', 'pixel359', 'pixel360', 'pixel361', 'pixel362', 'pixel363', 'pixel364', 'pixel365', 'pixel366', 'pixel367', 'pixel368', 'pixel369', 'pixel370', 'pixel371', 'pixel372', 'pixel373', 'pixel374', 'pixel375', 'pixel376', 'pixel377', 'pixel378', 'pixel379', 'pixel380', 'pixel381', 'pixel382', 'pixel383', 'pixel384', 'pixel385', 'pixel386', 'pixel387', 'pixel388', 'pixel389', 'pixel390', 'pixel391', 'pixel392', 'pixel393', 'pixel394', 'pixel395', 'pixel396', 'pixel397', 'pixel398', 'pixel399', 'pixel400', 'pixel401', 'pixel402', 'pixel403', 'pixel404', 'pixel405', 'pixel406', 'pixel407', 'pixel408', 'pixel409', 'pixel410', 'pixel411', 'pixel412', 'pixel413', 'pixel414', 'pixel415', 'pixel416', 'pixel417', 'pixel418', 'pixel419', 'pixel420', 'pixel421', 'pixel422', 'pixel423', 'pixel424', 'pixel425', 'pixel426', 'pixel427', 'pixel428', 'pixel429', 'pixel430', 'pixel431', 'pixel432', 'pixel433', 'pixel434', 'pixel435', 'pixel436', 'pixel437', 'pixel438', 'pixel439', 'pixel440', 'pixel441', 'pixel442', 'pixel443', 'pixel444', 'pixel445', 'pixel446', 'pixel447', 'pixel448', 'pixel449', 'pixel450', 'pixel451', 'pixel452', 'pixel453', 'pixel454', 'pixel455', 'pixel456', 'pixel457', 'pixel458', 'pixel459', 'pixel460', 'pixel461', 'pixel462', 'pixel463', 'pixel464', 'pixel465', 'pixel466', 'pixel467', 'pixel468', 'pixel469', 'pixel470', 'pixel471', 'pixel472', 'pixel473', 'pixel474', 'pixel475', 'pixel476', 'pixel477', 'pixel478', 'pixel479', 'pixel480', 'pixel481', 'pixel482', 'pixel483', 'pixel484', 'pixel485', 'pixel486', 'pixel487', 'pixel488', 'pixel489', 'pixel490', 'pixel491', 'pixel492', 'pixel493', 'pixel494', 'pixel495', 'pixel496', 'pixel497', 'pixel498', 'pixel499', 'pixel500', 'pixel501', 'pixel502', 'pixel503', 'pixel504', 'pixel505', 'pixel506', 'pixel507', 'pixel508', 'pixel509', 'pixel510', 'pixel511', 'pixel512', 'pixel513', 'pixel514', 'pixel515', 'pixel516', 'pixel517', 'pixel518', 'pixel519', 'pixel520', 'pixel521', 'pixel522', 'pixel523', 'pixel524', 'pixel525', 'pixel526', 'pixel527', 'pixel528', 'pixel529', 'pixel530', 'pixel531', 'pixel532', 'pixel533', 'pixel534', 'pixel535', 'pixel536', 'pixel537', 'pixel538', 'pixel539', 'pixel540', 'pixel541', 'pixel542', 'pixel543', 'pixel544', 'pixel545', 'pixel546', 'pixel547', 'pixel548', 'pixel549', 'pixel550', 'pixel551', 'pixel552', 'pixel553', 'pixel554', 'pixel555', 'pixel556', 'pixel557', 'pixel558', 'pixel559', 'pixel560', 'pixel561', 'pixel562', 'pixel563', 'pixel564', 'pixel565', 'pixel566', 'pixel567', 'pixel568', 'pixel569', 'pixel570', 'pixel571', 'pixel572', 'pixel573', 'pixel574', 'pixel575', 'pixel576', 'pixel577', 'pixel578', 'pixel579', 'pixel580', 'pixel581', 'pixel582', 'pixel583', 'pixel584', 'pixel585', 'pixel586', 'pixel587', 'pixel588', 'pixel589', 'pixel590', 'pixel591', 'pixel592', 'pixel593', 'pixel594', 'pixel595', 'pixel596', 'pixel597', 'pixel598', 'pixel599', 'pixel600', 'pixel601', 'pixel602', 'pixel603', 'pixel604', 'pixel605', 'pixel606', 'pixel607', 'pixel608', 'pixel609', 'pixel610', 'pixel611', 'pixel612', 'pixel613', 'pixel614', 'pixel615', 'pixel616', 'pixel617', 'pixel618', 'pixel619', 'pixel620', 'pixel621', 'pixel622', 'pixel623', 'pixel624', 'pixel625', 'pixel626', 'pixel627', 'pixel628', 'pixel629', 'pixel630', 'pixel631', 'pixel632', 'pixel633', 'pixel634', 'pixel635', 'pixel636', 'pixel637', 'pixel638', 'pixel639', 'pixel640', 'pixel641', 'pixel642', 'pixel643', 'pixel644', 'pixel645', 'pixel646', 'pixel647', 'pixel648', 'pixel649', 'pixel650', 'pixel651', 'pixel652', 'pixel653', 'pixel654', 'pixel655', 'pixel656', 'pixel657', 'pixel658', 'pixel659', 'pixel660', 'pixel661', 'pixel662', 'pixel663', 'pixel664', 'pixel665', 'pixel666', 'pixel667', 'pixel668', 'pixel669', 'pixel670', 'pixel671', 'pixel672', 'pixel673', 'pixel674', 'pixel675', 'pixel676', 'pixel677', 'pixel678', 'pixel679', 'pixel680', 'pixel681', 'pixel682', 'pixel683', 'pixel684', 'pixel685', 'pixel686', 'pixel687', 'pixel688', 'pixel689', 'pixel690', 'pixel691', 'pixel692', 'pixel693', 'pixel694', 'pixel695', 'pixel696', 'pixel697', 'pixel698', 'pixel699', 'pixel700', 'pixel701', 'pixel702', 'pixel703', 'pixel704', 'pixel705', 'pixel706', 'pixel707', 'pixel708', 'pixel709', 'pixel710', 'pixel711', 'pixel712', 'pixel713', 'pixel714', 'pixel715', 'pixel716', 'pixel717', 'pixel718', 'pixel719', 'pixel720', 'pixel721', 'pixel722', 'pixel723', 'pixel724', 'pixel725', 'pixel726', 'pixel727', 'pixel728', 'pixel729', 'pixel730', 'pixel731', 'pixel732', 'pixel733', 'pixel734', 'pixel735', 'pixel736', 'pixel737', 'pixel738', 'pixel739', 'pixel740', 'pixel741', 'pixel742', 'pixel743', 'pixel744', 'pixel745', 'pixel746', 'pixel747', 'pixel748', 'pixel749', 'pixel750', 'pixel751', 'pixel752', 'pixel753', 'pixel754', 'pixel755', 'pixel756', 'pixel757', 'pixel758', 'pixel759', 'pixel760', 'pixel761', 'pixel762', 'pixel763', 'pixel764', 'pixel765', 'pixel766', 'pixel767', 'pixel768', 'pixel769', 'pixel770', 'pixel771', 'pixel772', 'pixel773', 'pixel774', 'pixel775', 'pixel776', 'pixel777', 'pixel778', 'pixel779', 'pixel780', 'pixel781', 'pixel782', 'pixel783', 'pixel784']
/usr/local/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py:1736: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
         pixel1    pixel2    pixel3    pixel4    pixel5    pixel6    pixel7    pixel8    pixel9   pixel10  ...  pixel775  pixel776  \
14204  0.687077  0.659519  0.654681  0.628471  0.635513  0.620686  0.553037  0.549104  0.517285  0.489673  ...  1.194075  1.095422   
19315  0.614704  0.559445  0.501567  0.448431  0.421781  0.373088  0.325997  0.283760  0.243629  0.204057  ... -0.875268 -1.302054   
9314   0.759449  0.759592  0.731238  0.679911  0.662229  0.620686  0.581417  0.549104  0.517285  0.489673  ... -0.734176 -0.874479   
20570 -0.229642 -0.266163 -0.315045 -0.348890 -0.433149 -0.479750 -0.525403 -0.571240 -0.607745 -0.716263  ... -2.207799 -1.867064   
23966 -0.639753 -0.591402 -0.570236 -0.554650 -0.566732 -0.589793 -0.582163 -0.600723 -0.607745 -0.652792  ...  0.190757  0.591494   

       pixel777  pixel778  pixel779  pixel780  pixel781  pixel782  pixel783  pixel784  
14204  1.021090  0.940411  0.878424  0.869080  0.863864  0.882087  0.873285  0.883140  
19315 -0.762573  0.438578  0.392316  0.348981  0.344565  0.328318  0.307443  0.308575  
9314  -0.762573  0.924729  1.395894  1.247333  1.131382  1.214349  1.140488  1.224774  
20570 -1.600119 -1.443295 -1.912776 -1.983582 -1.748368 -1.538675 -1.688724 -1.989688  
23966  0.338645  0.250391  0.188465  0.159854  0.139992  0.122632  0.103111  0.106700  

[5 rows x 784 columns]
In [34]:
# Histograms for each attribute after pre-processing
X_train_df[columns_to_scale].hist(layout=(dispRow,dispCol))
plt.show()
In [35]:
# Apply feature scaling and transformation to the validation dataset
scaled_features = scaler.transform(X_validation_df[columns_to_scale])
X_validation_df.loc[:,tuple(columns_to_scale)] = scaled_features
print(X_validation_df.head())
/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py:1736: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
         pixel1    pixel2    pixel3    pixel4    pixel5    pixel6    pixel7    pixel8    pixel9   pixel10  ...  pixel775  pixel776  \
9810  -0.760373 -0.741513 -0.697831 -0.631810 -0.593448 -0.534771 -0.610543 -0.630206 -0.607745 -0.621057  ... -1.392604  1.003799   
21941 -0.229642 -0.291181 -0.340564 -0.323170 -0.352999 -0.397217 -0.440263 -0.512275 -0.486120 -0.525852  ... -0.796884  0.072295   
17847 -0.374387 -0.441292 -0.468159 -0.477490 -0.540015 -0.562282 -0.582163 -0.630206 -0.698963 -0.747998  ...  0.128050  0.225001   
21970 -0.977491 -2.267637 -1.157175 -0.940451 -2.383457 -3.946121 -3.335023 -2.723482 -2.736180 -2.810782  ... -0.185487 -0.233116   
3586  -0.302014 -0.266163 -0.187449 -0.091690 -0.059117 -0.012065  0.013817  0.047897  0.061192  0.140587  ... -0.075749 -0.767585   

       pixel777  pixel778  pixel779  pixel780  pixel781  pixel782  pixel783  pixel784  
9810   0.648847  0.595401  0.533445  0.506587  0.454719  0.454894  0.448903  0.448334  
21941  0.431705  0.344485  0.298231  0.301699  0.297356  0.328318  0.354596  0.370690  
17847  0.105993 -0.000526 -0.046749 -0.092315 -0.080316  0.011879 -0.038350 -0.141761  
21970 -0.685022 -0.831686 -1.081688 -1.116751 -1.055969 -1.032372 -0.934267 -0.964787  
3586  -1.879301 -1.600117  0.502083  1.042446  0.974018  0.992841  1.014746  1.022899  

[5 rows x 784 columns]

3.c) Training Data Balancing

In [36]:
# Not applicable for this iteration of the project

3.d) Feature Selection

In [37]:
# Not applicable for this iteration of the project

3.e) Display the Final Datasets for Model-Building

In [38]:
# Finalize the training and validation datasets for the modeling activities
X_train = X_train_df.to_numpy()
y_train = y_train_df.ravel()
X_validation = X_validation_df.to_numpy()
y_validation = y_validation_df.ravel()
print("X_train.shape: {} y_train.shape: {}".format(X_train.shape, y_train.shape))
print("X_validation.shape: {} y_validation.shape: {}".format(X_validation.shape, y_validation.shape))
X_train.shape: (20591, 784) y_train.shape: (20591,)
X_validation.shape: (6864, 784) y_validation.shape: (6864,)
In [39]:
if notifyStatus: status_notify("Task 3 - Pre-process Data completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 4 - Train and Evaluate Models

In [40]:
if notifyStatus: status_notify("Task 4 - Train and Evaluate Models has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

4.a) Set test options and evaluation metric

In [41]:
# Set up Algorithms Spot-Checking Array
startTimeTraining = datetime.now()
startTimeModule = datetime.now()
train_models = []
train_results = []
train_model_names = []
train_metrics = []
train_models.append(('LDA', LinearDiscriminantAnalysis()))
train_models.append(('CART', DecisionTreeClassifier(random_state=seedNum)))
train_models.append(('KNN', KNeighborsClassifier(n_jobs=n_jobs)))
train_models.append(('BGT', BaggingClassifier(random_state=seedNum, n_jobs=n_jobs)))
train_models.append(('RNF', RandomForestClassifier(random_state=seedNum, n_jobs=n_jobs)))
train_models.append(('EXT', ExtraTreesClassifier(random_state=seedNum, n_jobs=n_jobs)))
In [42]:
# Generate model in turn
for name, model in train_models:
	if notifyStatus: status_notify("Algorithm "+name+" modeling has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
	startTimeModule = datetime.now()
	kfold = KFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
	cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring, n_jobs=n_jobs, verbose=1)
	train_results.append(cv_results)
	train_model_names.append(name)
	train_metrics.append(cv_results.mean())
	print("%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()))
	print(model)
	print ('Model training time:', (datetime.now() - startTimeModule), '\n')
	if notifyStatus: status_notify("Algorithm "+name+" modeling completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
print ('Average metrics ('+scoring+') from all models:',np.mean(train_metrics))
print ('Total training time for all models:',(datetime.now() - startTimeTraining))
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:   14.6s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
LDA: 0.990141 (0.002035)
LinearDiscriminantAnalysis()
Model training time: 0:00:14.794127 

[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:   36.8s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
CART: 0.848624 (0.005334)
DecisionTreeClassifier(random_state=888)
Model training time: 0:00:36.904367 

[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:  2.2min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
KNN: 0.988053 (0.001982)
KNeighborsClassifier(n_jobs=2)
Model training time: 0:02:10.108243 

[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:  2.1min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
BGT: 0.966102 (0.002819)
BaggingClassifier(n_jobs=2, random_state=888)
Model training time: 0:02:04.831001 

[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:   50.5s finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
RNF: 0.994561 (0.000954)
RandomForestClassifier(n_jobs=2, random_state=888)
Model training time: 0:00:50.652678 

EXT: 0.995775 (0.000792)
ExtraTreesClassifier(n_jobs=2, random_state=888)
Model training time: 0:00:21.501344 

Average metrics (accuracy) from all models: 0.9638758371977462
Total training time for all models: 0:06:18.898987
[Parallel(n_jobs=2)]: Done   5 out of   5 | elapsed:   21.4s finished

4.b) Spot-checking baseline algorithms

In [43]:
fig = plt.figure(figsize=(16,12))
fig.suptitle('Algorithm Comparison - Spot Checking')
ax = fig.add_subplot(111)
plt.boxplot(train_results)
ax.set_xticklabels(train_model_names)
plt.show()
In [44]:
if notifyStatus: status_notify("Task 4 - Train and Evaluate Models completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 5 - Fine-tune and Improve Models

In [45]:
if notifyStatus: status_notify("Task 5 - Fine-tune and Improve Models has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

5.a) Algorithm Tuning

In [46]:
# Set up the comparison array
tune_results = []
tune_model_names = []
In [47]:
# Tuning algorithm #1 - Extra Trees
startTimeModule = datetime.now()
if notifyStatus: status_notify("Algorithm #1 tuning has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

tune_model1 = ExtraTreesClassifier(random_state=seedNum, n_jobs=n_jobs)
tune_model_names.append('EXT')
paramGrid1 = dict(n_estimators=np.array([100, 200, 300, 400, 500]))

kfold = KFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
grid1 = GridSearchCV(estimator=tune_model1, param_grid=paramGrid1, scoring=scoring, cv=kfold, n_jobs=n_jobs, verbose=1)
grid_result1 = grid1.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result1.best_score_, grid_result1.best_params_))
tune_results.append(grid_result1.cv_results_['mean_test_score'])
means = grid_result1.cv_results_['mean_test_score']
stds = grid_result1.cv_results_['std_test_score']
params = grid_result1.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
if notifyStatus: status_notify("Algorithm #1 tuning completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  25 out of  25 | elapsed:  4.5min finished
Best: 0.996843 using {'n_estimators': 500}
0.995775 (0.000792) with: {'n_estimators': 100}
0.996358 (0.001007) with: {'n_estimators': 200}
0.996649 (0.000471) with: {'n_estimators': 300}
0.996746 (0.000697) with: {'n_estimators': 400}
0.996843 (0.000752) with: {'n_estimators': 500}
Model training time: 0:05:08.781074
In [48]:
best_paramKey1 = list(grid_result1.best_params_.keys())[0]
best_paramValue1 = list(grid_result1.best_params_.values())[0]
print("Captured the best parameter for algorithm #1:", best_paramKey1, '=', best_paramValue1)
Captured the best parameter for algorithm #1: n_estimators = 500
In [49]:
# Tuning algorithm #2 - Random Forest
startTimeModule = datetime.now()
if notifyStatus: status_notify("Algorithm #2 tuning has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

tune_model2 = RandomForestClassifier(random_state=seedNum, n_jobs=n_jobs)
tune_model_names.append('RNF')
paramGrid2 = dict(n_estimators=np.array([100, 200, 300, 400, 500]))

kfold = KFold(n_splits=n_folds, shuffle=True, random_state=seedNum)
grid2 = GridSearchCV(estimator=tune_model2, param_grid=paramGrid2, scoring=scoring, cv=kfold, n_jobs=n_jobs, verbose=1)
grid_result2 = grid2.fit(X_train, y_train)

print("Best: %f using %s" % (grid_result2.best_score_, grid_result2.best_params_))
tune_results.append(grid_result2.cv_results_['mean_test_score'])
means = grid_result2.cv_results_['mean_test_score']
stds = grid_result2.cv_results_['std_test_score']
params = grid_result2.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
print ('Model training time:',(datetime.now() - startTimeModule))
if notifyStatus: status_notify("Algorithm #2 tuning completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
Fitting 5 folds for each of 5 candidates, totalling 25 fits
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  25 out of  25 | elapsed: 10.7min finished
Best: 0.996115 using {'n_estimators': 400}
0.994561 (0.000954) with: {'n_estimators': 100}
0.995144 (0.000827) with: {'n_estimators': 200}
0.995435 (0.000976) with: {'n_estimators': 300}
0.996115 (0.000594) with: {'n_estimators': 400}
0.996066 (0.000710) with: {'n_estimators': 500}
Model training time: 0:12:01.765355
In [50]:
best_paramKey2 = list(grid_result2.best_params_.keys())[0]
best_paramValue2 = list(grid_result2.best_params_.values())[0]
print("Captured the best parameter for algorithm #2:", best_paramKey2, '=', best_paramValue2)
Captured the best parameter for algorithm #2: n_estimators = 400

5.b) Compare Algorithms After Tuning

In [51]:
fig = plt.figure(figsize=(16,12))
fig.suptitle('Algorithm Comparison - Post Tuning')
ax = fig.add_subplot(111)
plt.boxplot(tune_results)
ax.set_xticklabels(tune_model_names)
plt.show()
In [52]:
if notifyStatus: status_notify("Task 5 - Fine-tune and Improve Models completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

Task 6 - Finalize Model and Present Analysis

In [53]:
if notifyStatus: status_notify("Task 6 - Finalize Model and Present Analysis has begun! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))

6.a) Validate the models using the validation dataset

In [54]:
validation_model1 = ExtraTreesClassifier(n_estimators=best_paramValue1, random_state=seedNum, n_jobs=n_jobs)
validation_model1.fit(X_train, y_train)
print(validation_model1)
predictions1 = validation_model1.predict(X_validation)
print('Accuracy Score:', accuracy_score(y_validation, predictions1))
print(confusion_matrix(y_validation, predictions1))
print(classification_report(y_validation, predictions1))
ExtraTreesClassifier(n_estimators=500, n_jobs=2, random_state=888)
Accuracy Score: 0.9983974358974359
[[282   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0 252   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0 286   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0 299   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0 239   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0 301   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0 272   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0 253   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0 289   0   0   0   1   0   1   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0 278   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0 309   0   0   0   0   0   0   0
    1   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0 263   1   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0 288   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0 299   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0 272   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 320   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 324   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0 299
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  297   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0 290   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   1 268   1   0   0]
 [  0   0   0   0   0   0   0   0   0   3   0   0   0   0   0   0   0   0
    0   0   1 302   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0 291   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0 280]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       282
           1       1.00      1.00      1.00       252
           2       1.00      1.00      1.00       286
           3       1.00      1.00      1.00       299
           4       1.00      1.00      1.00       239
           5       1.00      1.00      1.00       301
           6       1.00      1.00      1.00       272
           7       1.00      1.00      1.00       253
           8       1.00      0.99      1.00       291
          10       0.99      1.00      0.99       278
          11       1.00      1.00      1.00       310
          12       1.00      1.00      1.00       264
          13       0.99      1.00      1.00       288
          14       1.00      1.00      1.00       299
          15       0.99      1.00      1.00       272
          16       1.00      1.00      1.00       320
          17       1.00      1.00      1.00       324
          18       1.00      1.00      1.00       300
          19       1.00      1.00      1.00       297
          20       1.00      1.00      1.00       290
          21       1.00      0.99      0.99       270
          22       1.00      0.99      0.99       306
          23       1.00      1.00      1.00       291
          24       1.00      1.00      1.00       280

    accuracy                           1.00      6864
   macro avg       1.00      1.00      1.00      6864
weighted avg       1.00      1.00      1.00      6864

In [55]:
validation_model2 = RandomForestClassifier(n_estimators=best_paramValue2, random_state=seedNum, n_jobs=n_jobs)
validation_model2.fit(X_train, y_train)
print(validation_model2)
predictions2 = validation_model2.predict(X_validation)
print('Accuracy Score:', accuracy_score(y_validation, predictions2))
print(confusion_matrix(y_validation, predictions2))
print(classification_report(y_validation, predictions2))
RandomForestClassifier(n_estimators=400, n_jobs=2, random_state=888)
Accuracy Score: 0.9978146853146853
[[282   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0 252   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0 286   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0 298   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   1   0]
 [  0   0   0   0 239   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0 301   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0 272   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0 253   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0 290   0   0   0   0   0   1   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0 277   0   0   0   0   0   0   1   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0 309   0   0   0   0   0   0   0
    1   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0 263   1   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0 287   0   0   0   0   1
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0 299   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0 271   0   1   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 320   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 324   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0 299
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
  297   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0
    0 289   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0
    0   0 267   2   0   0]
 [  0   0   0   0   0   0   0   0   0   3   0   0   0   0   0   0   0   0
    0   0   0 303   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0 291   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0 280]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       282
           1       1.00      1.00      1.00       252
           2       1.00      1.00      1.00       286
           3       1.00      1.00      1.00       299
           4       1.00      1.00      1.00       239
           5       1.00      1.00      1.00       301
           6       1.00      1.00      1.00       272
           7       1.00      1.00      1.00       253
           8       1.00      1.00      1.00       291
          10       0.99      1.00      0.99       278
          11       1.00      1.00      1.00       310
          12       1.00      1.00      1.00       264
          13       1.00      1.00      1.00       288
          14       1.00      1.00      1.00       299
          15       0.99      1.00      0.99       272
          16       1.00      1.00      1.00       320
          17       0.99      1.00      0.99       324
          18       1.00      1.00      1.00       300
          19       1.00      1.00      1.00       297
          20       1.00      1.00      1.00       290
          21       1.00      0.99      0.99       270
          22       0.99      0.99      0.99       306
          23       1.00      1.00      1.00       291
          24       1.00      1.00      1.00       280

    accuracy                           1.00      6864
   macro avg       1.00      1.00      1.00      6864
weighted avg       1.00      1.00      1.00      6864

6.b) Create a test model using all available data

In [56]:
# Combining the training and validation datasets to form the complete dataset that will be used for training the final model
X_complete = np.vstack((X_train, X_validation))
y_complete = np.concatenate((y_train, y_validation))
print("X_complete.shape: {} y_complete.shape: {}".format(X_complete.shape, y_complete.shape))
test_model = validation_model1.fit(X_complete, y_complete)
print(test_model)
X_complete.shape: (27455, 784) y_complete.shape: (27455,)
ExtraTreesClassifier(n_estimators=500, n_jobs=2, random_state=888)

6.c) Load test dataset and measure predictions

In [57]:
dataset_path = 'https://dainesanalytics.com/datasets/kaggle-sign-language-mnist/sign_mnist_test.csv'
Xy_test = pd.read_csv(dataset_path, sep=',')

# Take a peek at the dataframe after import
Xy_test.head()
Out[57]:
label pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 pixel8 pixel9 ... pixel775 pixel776 pixel777 pixel778 pixel779 pixel780 pixel781 pixel782 pixel783 pixel784
0 6 149 149 150 150 150 151 151 150 151 ... 138 148 127 89 82 96 106 112 120 107
1 5 126 128 131 132 133 134 135 135 136 ... 47 104 194 183 186 184 184 184 182 180
2 10 85 88 92 96 105 123 135 143 147 ... 68 166 242 227 230 227 226 225 224 222
3 0 203 205 207 206 207 209 210 209 210 ... 154 248 247 248 253 236 230 240 253 255
4 3 188 191 193 195 199 201 202 203 203 ... 26 40 64 48 29 46 49 46 46 53

5 rows × 785 columns

In [58]:
# Standardize the class column to the name of targetVar if required
Xy_test = Xy_test.rename(columns={'label': 'targetVar'})
In [59]:
X_test_df = Xy_test.iloc[:,1:totCol]
y_test_df = Xy_test.iloc[:,0]
print("Xy_test.shape: {} X_test_df.shape: {} y_test_df.shape: {}".format(Xy_test.shape, X_test_df.shape, y_test_df.shape))
Xy_test.shape: (7172, 785) X_test_df.shape: (7172, 784) y_test_df.shape: (7172,)
In [60]:
# Apply feature scaling and transformation to the test dataset
scaled_features = scaler.transform(X_test_df[columns_to_scale])
X_test_df.loc[:,tuple(columns_to_scale)] = scaled_features
print(X_test_df.head())
/usr/local/lib/python3.7/site-packages/pandas/core/indexing.py:1736: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
     pixel1    pixel2    pixel3    pixel4    pixel5    pixel6    pixel7    pixel8    pixel9   pixel10  ...  pixel775  pixel776  pixel777  \
0  0.083972  0.009040 -0.034334 -0.091690 -0.165984 -0.204641 -0.269983 -0.364861 -0.394901 -0.430646  ... -0.044395  0.011213 -0.405840   
1 -0.470883 -0.516347 -0.519197 -0.554650 -0.620165 -0.672326 -0.724063 -0.807103 -0.850995 -0.874938  ... -1.470988 -0.660691  0.633337   
2 -1.459974 -1.517084 -1.514442 -1.480571 -1.368228 -0.974945 -0.724063 -0.571240 -0.516526 -0.430646  ... -1.141774  0.286083  1.377822   
3  1.386677  1.410071  1.420254  1.348632  1.356860  1.390990  1.404437  1.374622  1.399065  1.378257  ...  0.206434  1.538268  1.455373   
4  1.024815  1.059813  1.062987  1.065712  1.143127  1.170903  1.177397  1.197725  1.186222  1.187846  ... -1.800201 -1.638006 -1.382977   

   pixel778  pixel779  pixel780  pixel781  pixel782  pixel783  pixel784  
0 -1.098285 -1.254178 -1.053709 -0.898606 -0.795042 -0.651346 -0.825028  
1  0.375849  0.376636  0.333221  0.328828  0.344140  0.323160  0.308575  
2  1.065869  1.066595  1.010925  0.989755  0.992841  0.983310  0.960784  
3  1.395197  1.427256  1.152770  1.052700  1.230171  1.439128  1.473235  
4 -1.741258 -2.085266 -1.841737 -1.795577 -1.839293 -1.814467 -1.663583  

[5 rows x 784 columns]
In [61]:
# Finalize the test dataset for the modeling testing
X_test = X_test_df.to_numpy()
y_test = y_test_df.ravel()
print("X_test.shape: {} y_test.shape: {}".format(X_test.shape, y_test.shape))
X_test.shape: (7172, 784) y_test.shape: (7172,)
In [62]:
test_predictions = test_model.predict(X_test)
print('Accuracy Score:', accuracy_score(y_test, test_predictions))
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
Accuracy Score: 0.8349135527049637
[[331   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0 404   0   2   0   0   0   0   0  26   0   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0 309   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0 244   0   0   0   0   0   0   0   0   0   0   0   0   1   0
    0   0   0   0   0   0]
 [  0   0   0   0 490   0   0   0   0   0   0   1   0   0   0   0   0   7
    0   0   0   0   0   0]
 [  0   0  16   0   0 226   0   0   0   0   2   0   0   0   0   0   0   0
    3   0   0   0   0   0]
 [  0   0   0   0   0   0 302   1   0   0   0   0   0   0   2   0   0   0
   43   0   0   0   0   0]
 [  0   0   0   0   0   0  19 413   0   0   0   0   0   0   0   0   0   0
    4   0   0   0   0   0]
 [  3   0   0   0   0   0   0   0 234   0   0   6   3   0   0   0   0  21
    2   0   0   0   0  19]
 [  0   0   0   5   0   0   0   0  13 238   1   0   0   0   0   0  46   4
    2   6   0  16   0   0]
 [  0   0   0   0   0   0   0   0   0   0 209   0   0   0   0   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0  20   0   0   0   0   0   0 280  36   0   0   0   0  58
    0   0   0   0   0   0]
 [ 22   0   1   4  20   0   0   0   1   0   0  15 174   9   0   2   0  21
   20   2   0   0   0   0]
 [  0   0  16   0   0  13   0   0   0   0   0   0   0 216   0   0   0   0
    1   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0 347   0   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 164   0   0
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0  21   0   0   0   0   0  72   0
    0  51   0   0   0   0]
 [  0   0   0   0   8   0   0   0  10   0   0   6   0   0   0   0   0 222
    0   0   0   0   0   0]
 [  0   0   0   0   0   0   0   0   0   0   7   0   0   0   0   0   0   0
  208   0   0   0  33   0]
 [  0   9   0  13   0   0   0   0   0  15   0   0   0   0   0   0  53   0
    0 166   8   2   0   0]
 [  0   0   0   0   0   0   0   0   0   2   0   0   0   0   5   0  19   0
   16  23 204  77   0   0]
 [  0   0   0   0   0   0   0   0   0  20   0   0   0   0   0   0  32   0
    0  16   3 135   0   0]
 [  0   0   0   0   0   0   0   0   0  10   0   0   0   0   4   0   4  18
   20   0   0  24 187   0]
 [  0   0   0   0   0   0   0   0  16   5   0   0   0   0   0   0  46   1
   32   0  18   1   0 213]]
              precision    recall  f1-score   support

           0       0.93      1.00      0.96       331
           1       0.98      0.94      0.96       432
           2       0.90      1.00      0.95       310
           3       0.91      1.00      0.95       245
           4       0.91      0.98      0.95       498
           5       0.95      0.91      0.93       247
           6       0.94      0.87      0.90       348
           7       1.00      0.95      0.97       436
           8       0.85      0.81      0.83       288
          10       0.75      0.72      0.74       331
          11       0.87      1.00      0.93       209
          12       0.91      0.71      0.80       394
          13       0.82      0.60      0.69       291
          14       0.96      0.88      0.92       246
          15       0.97      1.00      0.98       347
          16       0.99      1.00      0.99       164
          17       0.26      0.50      0.35       144
          18       0.63      0.90      0.74       246
          19       0.59      0.84      0.69       248
          20       0.63      0.62      0.63       266
          21       0.88      0.59      0.70       346
          22       0.53      0.66      0.59       206
          23       0.85      0.70      0.77       267
          24       0.92      0.64      0.76       332

    accuracy                           0.83      7172
   macro avg       0.83      0.83      0.82      7172
weighted avg       0.86      0.83      0.84      7172

In [63]:
if notifyStatus: status_notify("Task 6 - Finalize Model and Present Analysis completed! "+datetime.now().strftime('%a %B %d, %Y %I:%M:%S %p'))
In [64]:
print ('Total time for the script:',(datetime.now() - startTimeScript))
Total time for the script: 0:36:15.140354